Add DSv4 FP8 H200 SGLang MTP benchmark by functionstackx · Pull Request #1265 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-02T22:16:15Z

Summary

Adds dsv4-fp8-h200-sglang-mtp, the MTP variant of dsv4-fp8-h200-sglang (sglang dsv4-pro hopper (rebased on main, with --disable-radix-cache) #1264).
Same image (lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc…), same h200-dgxc runner pool, same search space (TP=8 EP=1, conc 1 and 4-64 for 1k1k and 8k1k) — search-space entries gain spec-decoding: mtp.
New launcher script benchmarks/single_node/dsv4_fp8_h200_sglang_mtp.sh mirrors the non-MTP one with the EAGLE speculative-decoding flags appended:
- --speculative-algorithm EAGLE
- --speculative-num-steps 3
- --speculative-eagle-topk 1
- --speculative-num-draft-tokens 4
The (3,1,4) chain matches the existing dsv4-fp4-b300-sglang-mtp TP-only path.
Per the AGENTS.md rule for *_mtp.sh scripts, run_benchmark_serving is invoked with --dsv4 so prompts are chat-formatted (the canonical DSv4-Pro tokenizer ships no jinja chat template, so plain --use-chat-template would crash; --dsv4 routes through encoding_dsv4.py from [DSv4] add jinja chat template support #1153).
Script lookup resolves correctly via the framework-tagged + _mtp suffix logic that already landed in launch_h200-dgxc-slurm.sh from sglang dsv4-pro hopper (rebased on main, with --disable-radix-cache) #1264.
perf-changelog.yaml updated; PR-link backfilled to Add DSv4 FP8 H200 SGLang MTP benchmark #1265.

Test plan

CI sweep on the new dsv4-fp8-h200-sglang-mtp recipe lands on h200-dgxc and produces results
Verify EAGLE acceptance rate is reasonable (compare vs non-MTP TPS at matching concurrencies)
Confirm the (3,1,4) EAGLE chain is the right starting point on H200 (vs. e.g. (1,1,2) at higher concurrencies)

🤖 Generated with Claude Code

Mirror of dsv4-fp8-h200-sglang plus EAGLE speculative decoding flags (--speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4). The (3,1,4) chain matches the dsv4-fp4-b300-sglang-mtp TP-only path. Same image, runner pool (h200-dgxc), and search space as the non-MTP entry. The launcher resolves the new spec-decoding: mtp matrix entries to benchmarks/single_node/dsv4_fp8_h200_sglang_mtp.sh via the framework-tagged + _mtp suffix lookup that landed with #1264. run_benchmark_serving uses --dsv4 (DSv4-Pro chat framing) per the AGENTS.md rule that all MTP scripts must benchmark against chat-formatted inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-02T22:16:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-02T22:16:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-02T22:27:40Z

+    - "EAGLE speculative decoding chain: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4"
+    - "run_benchmark_serving uses --dsv4 (chat-formatted prompts) per the AGENTS.md MTP rule, since EAGLE acceptance regresses on raw random tokens"
+    - "Search space mirrors the non-MTP H200 SGLang entry: TP=8 EP=1, conc 1 and 4-64 for both 1k1k and 8k1k, with spec-decoding: mtp"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1265


🟡 The new dsv4-fp8-h200-sglang-mtp entry (perf-changelog.yaml:2124) has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — the "XXX" placeholder was never replaced with this PR's real number (#1265). Despite the PR description claiming the link was backfilled, the committed file still has the placeholder; please update it to /pull/1265 to match the convention used by every other entry in the file.

Extended reasoning...

What the bug is. The new entry added by this PR for the dsv4-fp8-h200-sglang-mtp config in perf-changelog.yaml ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. "XXX" is clearly a templated placeholder — every one of the ~150 other entries in this same file uses a concrete PR number, and the PR's own description even claims "PR-link backfilled to #1265". The backfill never happened.\n\nHow it manifests. Anything that consumes perf-changelog.yaml and follows pr-link will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is not a valid PR. GitHub renders this as a 404. Any internal changelog tooling, dashboard, or script that crawls these links to surface release notes will silently produce a broken hyperlink for this one entry.\n\nStep-by-step proof. (1) The PR description states "perf-changelog.yaml updated; PR-link backfilled to #1265." (2) The pre-loaded modified-files content for perf-changelog.yaml literally ends with the line pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. (3) Independently confirmed by running git show HEAD:perf-changelog.yaml | tail -1 against commit 2f28e59 — it returns pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. (4) The PR's own number is #1265 (per the metadata at the top of the timeline), and the immediately-prior entry in the same file correctly uses /pull/1264. The intended value is unambiguously 1265.\n\nAddressing the refutation. A verifier objected that get_pr_diff shows + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1265 and concluded the merged result will be correct. That is contradicted by directly inspecting the committed tree: git show HEAD:perf-changelog.yaml on the merge candidate (2f28e59) shows /pull/XXX, not /pull/1265. Whatever the diff-fetching tool returned does not match what is actually on the branch — the on-disk file and the committed object both carry the placeholder. Since GitHub merges what's in the tree, not a synthesized diff, the placeholder is what will land on main if this PR is merged as-is.\n\nWhy existing review didn't catch it. It's a one-line change at the very tail of a 2000+ line YAML file, and the surrounding lines look intentional and well-formed. The PR description even asserts the backfill was done, which discourages a closer look. There's no schema check on pr-link values, so no CI signal.\n\nImpact and severity. No runtime impact — perf-changelog.yaml is documentation, not consumed by the benchmark pipeline. The blast radius is limited to whatever tooling renders this changelog. This is a trivial one-character fix (XXX → 1265), and easy to make before merging.\n\nHow to fix. Replace the last line of perf-changelog.yaml with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1265\n

github-actions · 2026-05-02T23:22:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25263257378
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25263257378

github-actions · 2026-05-03T00:33:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25264520177
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25264520177

AGENTS.md requires new perf-changelog entries to be appended to the end of the file (oldest at top, newest at bottom). The original commit prepended the new entry above PR #95; move it after the current last entry (PR #1265) to satisfy the convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1267) * Add B300 config: kimi-k2.5-int4-vllm (vLLM 0.20.0 + TP=4/EP=1 sweep) - New `kimik2.5-int4-b300-vllm` config with the corresponding `benchmarks/single_node/kimik2.5_int4_b300.sh` launch script (mirrors the existing INT4 B200 vLLM recipe; the upstream vLLM Kimi-K2.5 recipes page does not yet ship B300-specific tuning). - Image: `vllm/vllm-openai:v0.20.0-cu130` — the original draft (#1057, reverted in #1070, reopened as #1071) carried `v0.19.0` while we waited on a working release; 0.20.0 has now shipped. - Search-space per (ISL, OSL): the existing TP=8 sweep plus a new TP=4 / EP=1 entry covering the lower-TP / expert-parallel variant on the same B300 nodes. Supersedes #1071 — opening fresh from main since the merge base had drifted (b200 schema migrated from `seq-len-configs` to `scenarios.fixed-seq-len`) and the user preferred a clean reopen over a rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: move kimik2.5-int4-b300-vllm entry to bottom AGENTS.md requires new perf-changelog entries to be appended to the end of the file (oldest at top, newest at bottom). The original commit prepended the new entry above PR #95; move it after the current last entry (PR #1265) to satisfy the convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add DSv4 FP8 H200 SGLang MTP benchmark Mirror of dsv4-fp8-h200-sglang plus EAGLE speculative decoding flags (--speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4). The (3,1,4) chain matches the dsv4-fp4-b300-sglang-mtp TP-only path. Same image, runner pool (h200-dgxc), and search space as the non-MTP entry. The launcher resolves the new spec-decoding: mtp matrix entries to benchmarks/single_node/dsv4_fp8_h200_sglang_mtp.sh via the framework-tagged + _mtp suffix lookup that landed with SemiAnalysisAI#1264. run_benchmark_serving uses --dsv4 (DSv4-Pro chat framing) per the AGENTS.md rule that all MTP scripts must benchmark against chat-formatted inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: fill in PR link for dsv4-fp8-h200-sglang-mtp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…emiAnalysisAI#1267) * Add B300 config: kimi-k2.5-int4-vllm (vLLM 0.20.0 + TP=4/EP=1 sweep) - New `kimik2.5-int4-b300-vllm` config with the corresponding `benchmarks/single_node/kimik2.5_int4_b300.sh` launch script (mirrors the existing INT4 B200 vLLM recipe; the upstream vLLM Kimi-K2.5 recipes page does not yet ship B300-specific tuning). - Image: `vllm/vllm-openai:v0.20.0-cu130` — the original draft (SemiAnalysisAI#1057, reverted in SemiAnalysisAI#1070, reopened as SemiAnalysisAI#1071) carried `v0.19.0` while we waited on a working release; 0.20.0 has now shipped. - Search-space per (ISL, OSL): the existing TP=8 sweep plus a new TP=4 / EP=1 entry covering the lower-TP / expert-parallel variant on the same B300 nodes. Supersedes SemiAnalysisAI#1071 — opening fresh from main since the merge base had drifted (b200 schema migrated from `seq-len-configs` to `scenarios.fixed-seq-len`) and the user preferred a clean reopen over a rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: move kimik2.5-int4-b300-vllm entry to bottom AGENTS.md requires new perf-changelog entries to be appended to the end of the file (oldest at top, newest at bottom). The original commit prepended the new entry above PR SemiAnalysisAI#95; move it after the current last entry (PR SemiAnalysisAI#1265) to satisfy the convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 2, 2026 22:16

functionstackx requested review from jgangani and kedarpotdar-nv as code owners May 2, 2026 22:16

github-project-automation Bot added this to InferenceMAX Board May 2, 2026

perf-changelog: fill in PR link for dsv4-fp8-h200-sglang-mtp

7f0d072

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx added the sweep-enabled label May 2, 2026

claude Bot reviewed May 2, 2026

View reviewed changes

functionstackx added full-sweep-enabled and removed sweep-enabled labels May 2, 2026

functionstackx merged commit 224aa29 into main May 3, 2026
84 of 85 checks passed

functionstackx deleted the claude/dsv4-fp8-h200-sglang-mtp branch May 3, 2026 00:36

github-project-automation Bot moved this to Done in InferenceMAX Board May 3, 2026

claude Bot mentioned this pull request May 3, 2026

Add B300 config: kimi-k2.5-int4-vllm (vLLM 0.20.0 + TP=4/EP=1 sweep) #1267

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DSv4 FP8 H200 SGLang MTP benchmark#1265

Add DSv4 FP8 H200 SGLang MTP benchmark#1265
functionstackx merged 2 commits intomainfrom
claude/dsv4-fp8-h200-sglang-mtp

functionstackx commented May 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

claude Bot May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

claude Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented May 2, 2026 •

edited

Loading